1 시각화 따라 해 보기
아래 사이트 접속 후 Rstudio에서 그래프 하나씩 따라해보세요.
[한국 R 사용자회 – 챗GPT 데이터 시각화 (r2bit.com)] https://r2bit.com/bitSlide/chatgpt_viz_202406.html#/데이터-시각화
library (tidyverse)
library (plotly)
library (gapminder)
library (crosstalk)
library (leaflet)
library (flipbookr)
head (gapminder)
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
1 ggplot으로 연도에 따른 기대 수명을 나라별로 그리기
2 ggplot을 ggplotly에 넣어서 interactive 그래프 만들기
4 Highlight 기능 넣기 (검색박스 만들기)
6 연결뷰 기능 (여러개 그래프 중 하나를 줌인 하면 다른 것도 줌인 되는 것)
7 대륙별로 1인당 gdp (gdpPercap) 과 기대수명(lifeExp) 를 연도별로 그래프 그리기
(ggplotly 사용)
8 대륙별로 1인당 gdp (gdpPercap) 과 기대수명(lifeExp) 를 연도별로 그래프 그리기(애니메이션)
(geom_point(aes(frame = year) 사용하기)
9 그래프 따라하기(1)
theme(legend.position = “top”, axis.text.x = element_text(angle = 90, hjust = 1)) 사용하기
10 그래프 따라하기(2)
facet_wrap(~country, scale=“free”) 사용하기
12 애니메이션 구현하기
따라서 작성해보기
gif 파일로 저장하기
library (gganimate)
life <- gapminder %>% filter ( country %in% c ("Korea, Rep." ,"Korea, Dem. Rep." , "China" , "United States" , "Japan" )) %>%
ggplot (aes (x= year, y= lifeExp, group = country, col= country))+
geom_line (alpha= 0.3 , linewidth= 1.5 ) +
geom_point (aes (frame = ` year ` ), size= 3.5 ) +
# scale_x_date(date_breaks="1 week", date_labels="%m-%d") +
# scale_y_continuous(labels=scales::percent) +
theme_bw (base_family= "NanumGothic" ) +
labs (x= "" , y= "기대수명" , color= "" ) +
theme (legend.position = "top" ,
axis.text.x = element_text (angle = 90 , hjust = 1 ),
axis.text= element_text (size= 16 , color= "black" ),
legend.text= element_text (size= 18 ),
plot.title = element_text (size= 22 ))
gganimate (life)
gganimate (life, "기대수명.gif" , ani.width = 640 , ani.height = 480 )
13 인터랙티브 그래프
펭귄 종별 몸무게
펭귄 지느러미 길이와 몸무게
14 ggiraph 패키지 연습해보기
library(ggiraph)
dat <- gapminder:: gapminder |>
janitor:: clean_names () |>
mutate (
# Reformat continent as a character instead of as a factor
# (will be important later)
id = levels (continent)[as.numeric (continent)],
continent = forcats:: fct_reorder (continent, life_exp)
)
color_palette <- thematic:: okabe_ito (5 )
names (color_palette) <- unique (dat$ continent)
base_size <- 18
mean_life_exps <- dat |>
group_by (continent, year, id) |>
summarise (mean_life_exp = mean (life_exp)) |>
ungroup ()
line_chart <- mean_life_exps |>
ggplot (aes (x = year, y = mean_life_exp, col = continent)) +
geom_line (linewidth = 2.5 ) +
geom_point (size = 4 ) +
theme_minimal (base_size = base_size) +
labs (
x = element_blank (),
y = 'Life expectancy (in years)' ,
title = 'Life expectancy over time'
) +
theme (
text = element_text (
color = 'grey20'
),
legend.position = 'none' ,
panel.grid.minor = element_blank (),
plot.title.position = 'plot'
) +
scale_color_manual (values = color_palette)
line_chart
library (ggiraph)
line_chart <- mean_life_exps |>
ggplot (aes (x = year, y = mean_life_exp, col = continent, data_id = id)) +
geom_line_interactive (linewidth = 2.5 ) +
geom_point_interactive (size = 4 ) +
theme_minimal (base_size = base_size) +
labs (
x = element_blank (),
y = 'Life expectancy (in years)' ,
title = 'Life expectancy over time'
) +
theme (
text = element_text (
color = 'grey20'
),
legend.position = 'none' ,
panel.grid.minor = element_blank (),
plot.title.position = 'plot'
) +
scale_color_manual (values = color_palette)
girafe (ggobj = line_chart)
girafe (
ggobj = line_chart,
options = list (
opts_hover (css = '' ), ## CSS code of line we're hovering over
opts_hover_inv (css = "opacity:0.1;" ), ## CSS code of all other lines
opts_sizing (rescale = FALSE ) ## Fixes sizes to dimensions below
),
height_svg = 6 ,
width_svg = 9
)
2 머신러닝 (회귀분석)
아래 kaggle 에서 중고차 가격 데이터를 다운로드 받아서 중고차 가격 예측하기
#라이브러리 불러오기
library (dplyr)
library (caret)
library (ModelMetrics)
library (rpart)
library (randomForest)
#데이터 불러오기
X_test <- read.csv ("https://raw.githubusercontent.com/waterfirst/Data_visualization/main/data/used_car/X_test.csv" , stringsAsFactors= T)
X_train <- read.csv ("https://raw.githubusercontent.com/waterfirst/Data_visualization/main/data/used_car/X_train.csv" , stringsAsFactors= T)
y_train <- read.csv ("https://raw.githubusercontent.com/waterfirst/Data_visualization/main/data/used_car/y_train.csv" , stringsAsFactors= T)
y_test <- read.csv ("https://raw.githubusercontent.com/waterfirst/Data_visualization/main/data/used_car/y_train.csv" , stringsAsFactors= T)
train<- inner_join (y_train, X_train)
head (train)
carID price brand model year transmission mileage fuelType tax mpg
1 13207 31995 hyundi Santa Fe 2019 Semi-Auto 4223 Diesel 145 39.8
2 17314 7700 vauxhall GTC 2015 Manual 47870 Diesel 125 60.1
3 12342 58990 audi RS4 2019 Automatic 5151 Petrol 145 29.1
4 13426 12999 vw Scirocco 2016 Automatic 20423 Diesel 30 57.6
5 16004 16990 skoda Scala 2020 Semi-Auto 3569 Petrol 145 47.1
6 18964 40890 merc V Class 2019 Automatic 4170 Diesel 145 44.1
engineSize
1 2.2
2 2.0
3 2.9
4 2.0
5 1.0
6 2.1
'data.frame': 4960 obs. of 11 variables:
$ carID : int 13207 17314 12342 13426 16004 18964 17053 19021 17429 16726 ...
$ price : int 31995 7700 58990 12999 16990 40890 25990 41980 25490 3491 ...
$ brand : Factor w/ 9 levels "audi","bmw","ford",..: 4 8 1 9 6 5 7 2 7 3 ...
$ model : Factor w/ 90 levels " 6 Series"," 7 Series",..: 69 35 63 71 70 80 55 51 16 45 ...
$ year : int 2019 2015 2019 2016 2020 2019 2020 2019 2019 2012 ...
$ transmission: Factor w/ 4 levels "Automatic","Manual",..: 4 2 1 1 4 1 1 4 1 2 ...
$ mileage : int 4223 47870 5151 20423 3569 4170 3 101 6340 85843 ...
$ fuelType : Factor w/ 5 levels "Diesel","Electric",..: 1 1 5 1 5 1 3 5 3 5 ...
$ tax : num 145 125 145 30 145 145 135 145 135 30 ...
$ mpg : num 39.8 60.1 29.1 57.6 47.1 44.1 64.2 34 52.3 57.7 ...
$ engineSize : num 2.2 2 2.9 2 1 2.1 1.8 3 2.5 1.2 ...
'data.frame': 2672 obs. of 10 variables:
$ carID : int 12000 12001 12004 12013 12017 12019 12020 12024 12027 12028 ...
$ brand : Factor w/ 9 levels "audi","bmw","ford",..: 5 9 5 6 1 8 5 5 7 9 ...
$ model : Factor w/ 89 levels " 6 Series"," 7 Series",..: 31 7 31 69 64 22 29 21 54 7 ...
$ year : int 2017 2017 2019 2019 2015 2020 2016 2015 2019 2020 ...
$ transmission: Factor w/ 4 levels "Automatic","Manual",..: 1 1 1 2 4 4 2 4 1 1 ...
$ mileage : int 12046 37683 10000 3257 20982 1950 26809 33087 4793 5000 ...
$ fuelType : Factor w/ 5 levels "Diesel","Electric",..: 1 1 1 5 5 5 1 1 3 1 ...
$ tax : num 150 260 145 145 325 145 20 165 140 265 ...
$ mpg : num 37.2 36.2 34 49.6 29.4 40.4 67.3 51.4 64.2 33.6 ...
$ engineSize : num 3 3 3 1 4 1.2 2.1 3 1.8 3 ...
중고차 가격을 3개의 머신러닝 모델을 이용하여 예측해보고, 가장 좋은 모델로 y_test 를 예측해서 csv 파일로 저장하라.
데이터 전처리, 불필요한 컬럼 제거, Na 처리, target y 확인
price brand year transmission mileage fuelType
0 0 0 0 0 0
tax mpg engineSize
0 0 0
brand year transmission mileage fuelType tax
0 0 0 0 0 0
mpg engineSize
0 0
훈련/검증 데이터 나누기
price brand year transmission mileage fuelType tax mpg engineSize
1 31995 hyundi 2019 Semi-Auto 4223 Diesel 145 39.8 2.2
2 7700 vauxhall 2015 Manual 47870 Diesel 125 60.1 2.0
4 12999 vw 2016 Automatic 20423 Diesel 30 57.6 2.0
6 40890 merc 2019 Automatic 4170 Diesel 145 44.1 2.1
8 41980 bmw 2019 Semi-Auto 101 Petrol 145 34.0 3.0
11 13750 vauxhall 2016 Manual 40230 Diesel 160 50.4 1.6
price brand year transmission mileage fuelType tax mpg engineSize
3 58990 audi 2019 Automatic 5151 Petrol 145 29.1 2.9
5 16990 skoda 2020 Semi-Auto 3569 Petrol 145 47.1 1.0
7 25990 toyota 2020 Automatic 3 Hybrid 135 64.2 1.8
9 25490 toyota 2019 Automatic 6340 Hybrid 135 52.3 2.5
10 3491 ford 2012 Manual 85843 Petrol 30 57.7 1.2
13 46050 vw 2019 Automatic 5000 Diesel 145 33.6 2.0
모델 만들기 (rpart, glm, randomForest)
CART
3473 samples
8 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 3473, 3473, 3473, 3473, 3473, 3473, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.1015714 11374.56 0.5138065 7875.244
0.1293171 12458.48 0.4181405 8767.174
0.3473557 14623.91 0.3369765 10520.659
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.1015714.
Generalized Linear Model
3473 samples
8 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 3473, 3473, 3473, 3473, 3473, 3473, ...
Resampling results:
RMSE Rsquared MAE
8880.687 0.7037728 5795.044
Call:
randomForest(formula = price ~ ., data = train_df, ntree = 10)
Type of random forest: regression
Number of trees: 10
No. of variables tried at each split: 2
Mean of squared residuals: 28715556
% Var explained: 89.36
모델 성능 검증 (R2)
로지스틱 회귀분석 모델 R2 = 0.7377449
최종 모델로 모델링하고 예측하기
Call:
randomForest(formula = price ~ ., data = train, ntree = 100)
Type of random forest: regression
Number of trees: 100
No. of variables tried at each split: 2
Mean of squared residuals: 15599489
% Var explained: 94.2
데이터 제출하기
x
1 38996.07
2 24447.39
3 57229.13
4 16062.26
5 48071.38
6 21900.47